Investigation of Japanese PnG BERT Language Model in Text-to-Speech Synthesis for Pitch Accent Language

نویسندگان

چکیده

End-to-end text-to-speech synthesis (TTS) can generate highly natural synthetic speech from raw text. However, rendering the correct pitch accents is still a challenging problem for end-to-end TTS. To tackle challenge of accent in Japanese TTS, we adopt PnG~BERT, self-supervised pretrained model character and phoneme domain We investigate effects features captured by PnG~BERT on TTS modifying fine-tuning condition to determine conditions helpful inferring accents. manipulate content being text-oriented speech-oriented changing number fine-tuned layers during In addition, teach information with tone prediction as an additional downstream task. Our experimental results show that pretraining contain accent, outperforms baseline Tacotron correctness listening test.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic pitch accent prediction for text-to-speech synthesis

Determining pitch accents in a sentence is a key task for a textto-speech (TTS) system. We describe some methods for pitch accent assignment which make use of features that contain information about a complete phrase or sentence, in contrast to most previous work which has focused on using features local to a syllable or word. Pitch accent prediction is performed using three different technique...

متن کامل

Toward Language-independent Text-to-speech Synthesis

Text-to-speech (TTS) synthesis is becoming a fundamental part of any embedded system that has to interact with humans. Language-independence in speech synthesis is a primary requirement for systems that are not practical to update, as is the case for most embedded systems. Because current text-to-speech synthesis usually refers to a single language and to a single speaker (or at most a limited ...

متن کامل

CRF-based statistical learning of Japanese accent sandhi for developing Japanese text-to-speech synthesis systems

In Japanese, every content word has its own H/L pitch pattern when it is uttered isolatedly, called accent type. In a TTS system, this lexical information is usually stored in a dictionary and it is referred to for prosody generation. When converting a written sentence to speech, however, this lexical H/L pattern is often changed according to the context, known as word accent sandhi. This accen...

متن کامل

A markup language for text-to-speech synthesis richard sproat

Text-to-speech synthesizers must process text, and therefore require some knowledge of text structure. While many TTS systems allow for user control by means of ad hoc ‘escape sequences’, there remains to date no adequate and generally agreed upon system-independent standard for marking up text for the purposes of synthesis. The present paper is a collaborative effort between two speech groups ...

متن کامل

A Markup Language for Text-to-speech Synthesis

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE Journal of Selected Topics in Signal Processing

سال: 2022

ISSN: ['1941-0484', '1932-4553']

DOI: https://doi.org/10.1109/jstsp.2022.3190672